| Package | Version | Purpose |
|---|---|---|
| ggplot2 | >=3.4.0 | Data visualization |
| dplyr | >=1.1.0 | Data manipulation |
| tidyr | >=1.3.0 | Data tidying |
| readr | >=2.1.0 | File I/O |
| purrr | >=1.0.0 | Functional programming |
| stringr | >=1.5.0 | String manipulation |
| tibble | >=3.2.0 | Modern data frames |
| patchwork | >=1.1.0 | Plot composition |
| here | >=1.0.0 | Project-relative paths |
| scales | >=1.2.0 | Axis formatting |
| randomForest | >=4.7 | Classification model |
| knitr | >=1.40 | Report generation |
| kableExtra | >=1.3.0 | Table formatting |
Single-Sample cfDNA WGBS Analysis Report
cfDNA WGBS Analysis and Prediction
1 Introduction
This report presents an representative analysis of a single cell-free DNA (cfDNA) sample from whole-genome bisulfite sequencing (WGBS) data. The analysis follows the same algorithms and methods used in the ALS vs Control study to ensure reproducibility and comparability.
1.1 Analysis Overview
The pipeline generates the following outputs:
- Quality Control Metrics: Mapping statistics, bisulfite conversion efficiency, and coverage metrics
- Fragment Size Analysis: Insert size distributions showing the characteristic cfDNA nucleosome pattern
- Fragmentation Ratio Track: Genome-wide short/long fragment ratio visualization
- Genomic Distribution: Distribution of fragment 5’ start sites across genomic features
- TSS Enrichment: Nucleosome positioning signal around transcription start sites
- End Motif Analysis: 5’ end 4-mer motif frequencies reflecting nuclease preferences
- Classification Prediction: Predicted group (ALS vs Ctrl) using pre-trained Random Forest model
1.2 Input Files
This notebook requires two input files:
- Bismark-aligned BAM file: Deduplicated BAM with XM methylation tags
- Bismark coverage file: Output from
bismark_methylation_extractor(.bismark.cov.gz)
If you only have a Bismark-aligned BAM file, run the following command to extract methylation calls:
For detailed documentation, see: Bismark Methylation Extraction
2 Dependencies
This section lists all software dependencies required to run this analysis.
2.1 R Packages
2.1.1 CRAN Packages
2.1.2 Bioconductor Packages
| Package | Version | Purpose |
|---|---|---|
| Rsamtools | >=2.14.0 | BAM file handling |
| cigarillo | >=1.0.0 | CIGAR string parsing |
| GenomicRanges | >=1.50.0 | Genomic intervals |
| IRanges | >=2.32.0 | Integer ranges |
| S4Vectors | >=0.36.0 | S4 vector infrastructure |
| Biostrings | >=2.66.0 | DNA sequence handling |
| BSgenome.Hsapiens.UCSC.hg38 | >=1.4.0 | Human reference genome (hg38) |
| GenomeInfoDb | >=1.34.0 | Genome metadata |
| TxDb.Hsapiens.UCSC.hg38.knownGene | >=3.16.0 | Gene annotations (hg38) |
| GenomicFeatures | >=1.50.0 | Genomic features |
| ChIPseeker | >=1.34.0 | Peak annotation |
| org.Hs.eg.db | >=3.16.0 | Human gene IDs |
| AnnotationDbi | >=1.60.0 | Annotation infrastructure |
| bsseq | >=1.34.0 | Bisulfite-seq analysis |
| Gviz | >=1.42.0 | Genomic track visualization |
2.2 External Tools
| Tool | Version | Purpose | Link |
|---|---|---|---|
| Bismark | >=0.24.0 | Bisulfite read alignment & methylation extraction | [GitHub](ht |
| samtools | >=1.17 | BAM file manipulation (indexing) | [GitHub](ht |
| Quarto | >=1.3.0 | Document rendering | [quarto.org |
# CRAN packages
install.packages(c("ggplot2", "dplyr", "tidyr", "readr", "purrr",
"stringr", "tibble", "patchwork", "here", "scales",
"randomForest", "knitr", "kableExtra"))
# Bioconductor packages
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c(
"Rsamtools", "cigarillo", "GenomicRanges", "IRanges", "S4Vectors",
"Biostrings", "BSgenome.Hsapiens.UCSC.hg38", "GenomeInfoDb",
"TxDb.Hsapiens.UCSC.hg38.knownGene", "GenomicFeatures",
"ChIPseeker", "org.Hs.eg.db", "AnnotationDbi",
"bsseq", "Gviz"
))3 Setup and Input Validation
Show code
# Load required packages
suppressPackageStartupMessages({
# CRAN packages
library(ggplot2)
library(dplyr)
library(tidyr)
library(readr)
library(purrr)
library(stringr)
library(tibble)
library(patchwork)
library(here)
library(knitr)
library(kableExtra)
library(scales)
library(randomForest)
# Bioconductor packages
library(Rsamtools)
library(cigarillo)
library(GenomicRanges)
library(IRanges)
library(S4Vectors)
library(Biostrings)
library(BSgenome.Hsapiens.UCSC.hg38)
library(GenomeInfoDb)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(GenomicFeatures)
library(ChIPseeker)
library(org.Hs.eg.db)
library(AnnotationDbi)
library(bsseq)
library(Gviz)
})
# Load analysis functions (from same directory as this notebook)
# The functions file should be in the same directory as this .qmd file
source("cfDNA_analysis_functions.R")
# Set seed for reproducibility
set.seed(389)Show code
# Parameters from YAML
BAM_FILE <- params$bam_file
BISMARK_COV_FILE <- params$bismark_cov_file
SAMPLE_ID <- params$sample_id
OUTPUT_DIR <- params$output_dir
PROJECT_DIR <- params$project_dir
# Create output directory
if (!dir.exists(OUTPUT_DIR)) {
dir.create(OUTPUT_DIR, recursive = TRUE, showWarnings = FALSE)
}
# Temporary directory for intermediate files
TEMP_DIR <- file.path(OUTPUT_DIR, "temp")
if (!dir.exists(TEMP_DIR)) {
dir.create(TEMP_DIR, recursive = TRUE, showWarnings = FALSE)
}
# Load reference data
genome <- BSgenome.Hsapiens.UCSC.hg38
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
# Path to pre-trained model (from cohort analysis)
MODEL_PATH <- file.path(PROJECT_DIR, "data", "processed", "methylation", "rds", "nested_cv_results.rds")
STACKHMM_PATH <- file.path(PROJECT_DIR, "data", "processed", "hg38_genome_100_segments.bed")Show code
# Validate input files
validation_results <- tibble::tibble(
File = c("BAM file", "Bismark coverage file", "Pre-trained model", "stackHMM annotations"),
Path = c(BAM_FILE, BISMARK_COV_FILE, MODEL_PATH, STACKHMM_PATH),
Exists = c(file.exists(BAM_FILE), file.exists(BISMARK_COV_FILE),
file.exists(MODEL_PATH), file.exists(STACKHMM_PATH))
)
knitr::kable(validation_results, caption = "Input file validation") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
kableExtra::row_spec(which(!validation_results$Exists), bold = TRUE, color = "white", background = "#E64B35")| File | Path | Exists |
|---|---|---|
| BAM file | /Users/qyan/Library/CloudStorage/GoogleDrive-qyan@altoslabs.com/My Drive/Documents/Job/PrimaMente_Bioinfo/Interview/Take_home_task/cfDNA_WGBS_ALS_GSE164600/data/raw/SRR13404367.deduplicated.sorted_ds10mill_chr21.bam | TRUE |
| Bismark coverage file | /Users/qyan/Library/CloudStorage/GoogleDrive-qyan@altoslabs.com/My Drive/Documents/Job/PrimaMente_Bioinfo/Interview/Take_home_task/cfDNA_WGBS_ALS_GSE164600/data/processed/methylation_extractor/SRR13404367.deduplicated.sorted_ds10mill_chr21.name_sorted/SRR13404367.deduplicated.sorted_ds10mill_chr21.name_sorted.bismark.cov.gz | TRUE |
| Pre-trained model | ../data/processed/methylation/rds/nested_cv_results.rds | TRUE |
| stackHMM annotations | ../data/processed/hg38_genome_100_segments.bed | TRUE |
4 QC Metrics
This section extracts quality control metrics from the BAM file, including mapping statistics, fragment size distribution, and methylation metrics.
Show code
# Extract BAM metrics (this generates the fragment BED file as well)
message("Extracting BAM metrics...")
metrics <- extract_bam_metrics(
bam_path = BAM_FILE,
sample_id = SAMPLE_ID,
genome = genome,
chunk_size = 1e6,
frag_dir = TEMP_DIR
)
# Calculate summary statistics
summary_stats <- calculate_summary_stats(metrics)Show code
# Format key metrics for display
qc_display <- tibble::tibble(
Metric = c(
"Total Reads",
"Mapped Reads",
"Mapping Rate",
"Properly Paired",
"Proper Pair Rate",
"Duplicates",
"Duplicate Rate",
"Filtered Reads (MAPQ≥30)",
"Filter Pass Rate",
"Mean MAPQ",
"Number of Fragments",
"Mean Fragment Length",
"Median Fragment Length",
"Mode Fragment Length",
"GC Content",
"CpG Methylation",
"Bisulfite Conversion"
),
Value = c(
format(summary_stats$total_reads, big.mark = ","),
format(summary_stats$mapped_reads, big.mark = ","),
sprintf("%.2f%%", summary_stats$mapping_rate * 100),
format(summary_stats$properly_paired, big.mark = ","),
sprintf("%.2f%%", summary_stats$proper_pair_rate * 100),
format(summary_stats$duplicates, big.mark = ","),
sprintf("%.2f%%", summary_stats$duplicate_rate * 100),
format(summary_stats$filtered_reads, big.mark = ","),
sprintf("%.2f%%", summary_stats$filter_pass_rate * 100),
sprintf("%.1f", summary_stats$mean_mapq),
format(summary_stats$n_fragments, big.mark = ","),
sprintf("%.1f bp", summary_stats$mean_frag_length),
sprintf("%.0f bp", summary_stats$median_frag_length),
sprintf("%.0f bp", summary_stats$frag_mode),
sprintf("%.2f%%", summary_stats$gc_content * 100),
sprintf("%.2f%%", summary_stats$cpg_methylation * 100),
sprintf("%.2f%%", summary_stats$bisulfite_conversion * 100)
)
)
knitr::kable(qc_display, caption = paste("QC Summary for", SAMPLE_ID)) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
kableExtra::pack_rows("Alignment Statistics", 1, 9) %>%
kableExtra::pack_rows("Fragment Statistics", 10, 14) %>%
kableExtra::pack_rows("Methylation Statistics", 15, 17)| Metric | Value |
|---|---|
| Alignment Statistics | |
| Total Reads | 128,966 |
| Mapped Reads | 128,966 |
| Mapping Rate | 100.00% |
| Properly Paired | 128,966 |
| Proper Pair Rate | 100.00% |
| Duplicates | 0 |
| Duplicate Rate | 0.00% |
| Filtered Reads (MAPQ≥30) | 111,610 |
| Filter Pass Rate | 86.54% |
| Fragment Statistics | |
| Mean MAPQ | 38.1 |
| Number of Fragments | 55,805 |
| Mean Fragment Length | 174.4 bp |
| Median Fragment Length | 162 bp |
| Mode Fragment Length | 160 bp |
| Methylation Statistics | |
| GC Content | 38.49% |
| CpG Methylation | 82.71% |
| Bisulfite Conversion | 99.53% |
4.1 QC Metrics Visualization
Show code
# Prepare data for bar chart
qc_plot_data <- tibble::tibble(
metric = c("Total Reads (M)", "Filtered Reads (M)", "Mean MAPQ",
"Median Fragment (bp)", "GC Content (%)",
"CpG Methylation (%)", "Bisulfite Conv. (%)"),
value = c(
summary_stats$total_reads / 1e6,
summary_stats$filtered_reads / 1e6,
summary_stats$mean_mapq,
summary_stats$median_frag_length,
summary_stats$gc_content * 100,
summary_stats$cpg_methylation * 100,
summary_stats$bisulfite_conversion * 100
),
category = c("Reads", "Reads", "Quality", "Fragment", "Fragment", "Methylation", "Methylation")
) %>%
dplyr::mutate(
metric = factor(metric, levels = rev(metric)),
label = dplyr::case_when(
grepl("Reads", metric) ~ sprintf("%.1fM", value),
grepl("MAPQ", metric) ~ sprintf("%.1f", value),
grepl("Fragment", metric) & !grepl("%", metric) ~ sprintf("%.0f bp", value),
TRUE ~ sprintf("%.1f%%", value)
)
)
# Create horizontal bar chart
p_qc <- ggplot(qc_plot_data, aes(x = value, y = metric, fill = category)) +
geom_col(width = 0.7, alpha = 0.9) +
geom_text(aes(label = label), hjust = -0.1, size = 3.5, fontface = "bold") +
scale_fill_brewer(palette = "Set2") +
scale_x_continuous(expand = expansion(mult = c(0, 0.25))) +
labs(
title = paste("QC Metrics:", SAMPLE_ID),
subtitle = "Key quality indicators from BAM file analysis",
x = "Value",
y = NULL,
fill = "Category"
) +
theme_pub(base_size = 12) +
theme(
legend.position = "top",
panel.grid.major.y = element_blank()
)
p_qc5 Fragmentation Analysis
Cell-free DNA exhibits characteristic insert size patterns reflecting nucleosome protection. This section analyzes the insert size distribution and genome-wide short/long fragment ratio.
5.1 Insert Size Distribution
Show code
# Prepare fragment length data
frag_df <- tibble::tibble(
fragment_length = metrics$fragment_lengths
) %>%
dplyr::filter(fragment_length >= 80, fragment_length <= 450)
# Count fragments by length
counts_saw <- frag_df %>%
dplyr::count(fragment_length, name = "n")
# Calculate mode for annotation
calc_mode <- function(x) {
x <- x[is.finite(x)]
if (length(x) == 0) return(NA_integer_)
x <- as.integer(x)
tabulate(x, nbins = max(x)) |> which.max()
}
mode_bp <- calc_mode(metrics$fragment_lengths)
median_bp <- median(metrics$fragment_lengths, na.rm = TRUE)
short_n <- sum(metrics$fragment_lengths >= 100 & metrics$fragment_lengths <= 150)
long_n <- sum(metrics$fragment_lengths >= 151 & metrics$fragment_lengths <= 220)
short_long_ratio <- if (long_n > 0) short_n / long_n else NA
# Create sawtooth plot
p_sawtooth <- ggplot(counts_saw, aes(x = fragment_length, y = n)) +
geom_line(linewidth = 0.5, color = "gray20") +
geom_vline(xintercept = 167, linetype = "dashed", color = "#E64B35", linewidth = 0.6) +
geom_vline(xintercept = 334, linetype = "dashed", color = "#4DBBD5", linewidth = 0.6) +
annotate("text", x = 175, y = max(counts_saw$n) * 0.95,
label = "Mono-nucleosome\n(~167 bp)", hjust = 0, size = 3, color = "#E64B35") +
annotate("text", x = 342, y = max(counts_saw$n) * 0.5,
label = "Di-nucleosome\n(~334 bp)", hjust = 0, size = 3, color = "#4DBBD5") +
annotate("label", x = 380, y = max(counts_saw$n) * 0.85,
label = sprintf("Median: %d bp\nMode: %d bp\nS/L ratio: %.3f",
as.integer(median_bp), mode_bp, short_long_ratio),
hjust = 0, size = 3, fill = "white", label.size = 0.3) +
labs(
title = "cfDNA Fragment Size Distribution",
subtitle = paste("Sample:", SAMPLE_ID),
x = "Fragment Length (bp)",
y = "Count"
) +
theme_pub(base_size = 12)
p_sawtooth5.2 Fragmentation Ratio Track
The short/long fragment ratio varies across the genome and can reveal nucleosome positioning differences.
Show code
# Set up genomic bins (chr21 only for efficiency)
bin_size <- 2e6
chroms <- "chr21"
seqlens <- GenomeInfoDb::seqlengths(genome)[chroms]
seqlens <- seqlens[!is.na(seqlens)]
bins_gr <- GenomicRanges::tileGenome(
seqlengths = seqlens,
tilewidth = bin_size,
cut.last.tile.in.chrom = TRUE
)
bins_df <- tibble::tibble(
bin_id = seq_along(bins_gr),
chrom = as.character(GenomeInfoDb::seqnames(bins_gr)),
start = start(bins_gr),
end = end(bins_gr),
mid = as.integer(floor((start + end) / 2))
)
# GC content per bin
bin_seqs <- Biostrings::getSeq(genome, bins_gr)
freq <- Biostrings::letterFrequency(bin_seqs, letters = c("A", "C", "G", "T", "N"), as.prob = FALSE)
acgt <- rowSums(freq[, c("A", "C", "G", "T"), drop = FALSE])
gc <- (freq[, "G"] + freq[, "C"]) / pmax(acgt, 1)
n_frac <- freq[, "N"] / Biostrings::width(bin_seqs)
bins_df <- bins_df %>%
dplyr::mutate(gc = as.numeric(gc), n_frac = as.numeric(n_frac))
# GC filtering thresholds
gc_min <- 0.10
gc_max <- 0.85
n_frac_max <- 0.45
bins_keep_for_gc <- is.finite(bins_df$gc) &
bins_df$gc >= gc_min & bins_df$gc <= gc_max &
is.finite(bins_df$n_frac) & bins_df$n_frac <= n_frac_max
# Compute bin counts
fragment_bed_path <- file.path(TEMP_DIR, paste0(SAMPLE_ID, ".fragments.bed.gz"))
bin_tracks <- bin_counts_one_sample(
path = fragment_bed_path,
bins_gr = bins_gr,
bins_df = bins_df,
bins_keep_for_gc = bins_keep_for_gc
)
bin_tracks <- bin_tracks %>%
dplyr::left_join(bins_df, by = "bin_id")Show code
# Plot fragmentation ratio track
p_frag_track <- ggplot(bin_tracks, aes(x = mid / 1e6, y = short_long_ratio)) +
geom_hline(yintercept = 1, color = "gray70", linewidth = 0.35) +
geom_line(linewidth = 0.8, color = "#2E4A62") +
geom_point(size = 1.5, color = "#2E4A62", alpha = 0.7) +
scale_x_continuous(breaks = seq(0, 50, by = 10)) +
labs(
title = paste("Fragmentation Ratio Track:", SAMPLE_ID),
subtitle = sprintf("GC-corrected short(90-150bp)/long(151-220bp) ratio in %d Mb bins on chr21",
bin_size / 1e6),
x = "Genomic Position (Mb)",
y = "Short/Long Ratio"
) +
theme_pub(base_size = 12)
p_frag_trackShow code
# Gviz visualization
if (requireNamespace("Gviz", quietly = TRUE)) {
genome_build <- "hg38"
chrom <- "chr21"
from_bp <- min(bins_df$start, na.rm = TRUE)
to_bp <- max(bins_df$end, na.rm = TRUE)
gr_bins <- GenomicRanges::GRanges(
seqnames = bins_df$chrom,
ranges = IRanges::IRanges(start = bins_df$start, end = bins_df$end)
)
y <- bin_tracks$short_long_ratio
y_lim <- range(y[is.finite(y)], na.rm = TRUE)
y_pad <- 0.15 * diff(y_lim)
y_lim <- c(y_lim[1] - y_pad, y_lim[2] + y_pad)
axis_track <- Gviz::GenomeAxisTrack(genome = genome_build, chromosome = chrom, name = "")
ratio_track <- Gviz::DataTrack(
range = gr_bins,
data = y,
genome = genome_build,
chromosome = chrom,
name = "S/L ratio",
type = "l",
col = "#2E4A62",
lwd = 2,
ylim = y_lim,
baseline = 1,
col.baseline = "gray45",
lty.baseline = 2,
lwd.baseline = 1,
cex.axis = 0.9,
cex.title = 0.9
)
ideo_track <- tryCatch(
Gviz::IdeogramTrack(genome = genome_build, chromosome = chrom),
error = function(e) NULL
)
tracks <- list(axis_track, ratio_track)
if (!is.null(ideo_track)) tracks <- c(list(ideo_track), tracks)
Gviz::plotTracks(
tracks,
from = from_bp,
to = to_bp,
background.title = "white",
col.title = "black",
cex.title = 0.95,
background.panel = "white",
col.axis = "black",
col.frame = "gray85"
)
}6 Fragment Start Site Analysis
This section analyzes the genomic distribution of fragment 5’ start sites and their enrichment around transcription start sites.
6.1 Genomic Distribution
Show code
>> preparing features information... 2026-01-30 15:39:18
>> Using Genome: hg38 ...
>> identifying nearest features... 2026-01-30 15:39:20
>> calculating distance from peak to TSS... 2026-01-30 15:39:21
>> assigning genomic annotation... 2026-01-30 15:39:21
>> Using Genome: hg38 ...
>> Using Genome: hg38 ...
>> adding flank feature information from peaks... 2026-01-30 15:40:02
>> assigning chromosome lengths 2026-01-30 15:40:07
>> done... 2026-01-30 15:40:07
Show code
# Prepare data for donut chart
annot_data <- start_annot %>%
tidyr::pivot_longer(
cols = c(Promoter, Exon, Intron, Three_UTR, Five_UTR, Distal_Intergenic, Downstream),
names_to = "annotation",
values_to = "n"
) %>%
dplyr::mutate(
pct = 100 * n / sum(n),
annotation = dplyr::case_when(
annotation == "Three_UTR" ~ "3' UTR",
annotation == "Five_UTR" ~ "5' UTR",
annotation == "Distal_Intergenic" ~ "Distal Intergenic",
TRUE ~ annotation
),
annotation = factor(annotation, levels = c("Promoter", "5' UTR", "Exon", "Intron",
"3' UTR", "Downstream", "Distal Intergenic"))
) %>%
dplyr::arrange(annotation) %>%
dplyr::mutate(
ymax = cumsum(pct),
ymin = dplyr::lag(ymax, default = 0),
label_pos = (ymin + ymax) / 2,
label = sprintf("%s\n%.1f%%", annotation, pct)
)
# Create donut chart
p_donut <- ggplot(annot_data, aes(ymax = ymax, ymin = ymin, xmax = 4, xmin = 2.5, fill = annotation)) +
geom_rect(color = "white", linewidth = 0.5) +
geom_text(aes(x = 3.25, y = label_pos, label = sprintf("%.1f%%", pct)),
size = 3, color = "white", fontface = "bold") +
coord_polar(theta = "y") +
xlim(c(1, 4)) +
scale_fill_brewer(palette = "Set2") +
annotate("text", x = 1, y = 0, label = paste("Total\n", format(start_annot$total_starts, big.mark = ",")),
size = 4, fontface = "bold") +
labs(
title = "Genomic Distribution of Fragment 5' Starts",
subtitle = paste("Sample:", SAMPLE_ID),
fill = "Annotation"
) +
theme_void(base_size = 12) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "right"
)
p_donut6.2 TSS Enrichment
Show code
# Get muscle-related genes on chr21
muscle_symbols <- c("COL6A1", "COL6A2")
muscle_map <- AnnotationDbi::select(
org.Hs.eg.db::org.Hs.eg.db,
keys = muscle_symbols,
keytype = "SYMBOL",
columns = c("SYMBOL", "ENTREZID")
) %>%
dplyr::filter(!is.na(.data$ENTREZID)) %>%
dplyr::mutate(ENTREZID = as.character(.data$ENTREZID))
genes_chr21 <- suppressMessages(GenomicFeatures::genes(txdb))
genes_chr21 <- genes_chr21[as.character(GenomeInfoDb::seqnames(genes_chr21)) == "chr21"]
genes_muscle <- genes_chr21[as.character(genes_chr21$gene_id) %in% unique(muscle_map$ENTREZID)]
# TSS windows for muscle genes
tss_points_m <- unique(GenomicRanges::promoters(genes_muscle, upstream = 0L, downstream = 1L))
tss_windows_m <- unique(GenomicRanges::promoters(genes_muscle, upstream = 2000L, downstream = 2000L))
tss_site_m <- start(tss_points_m)
tss_strand_m <- as.character(strand(tss_points_m))
# Compute TSS enrichment profile
tss_profile <- profile_tss_enrichment_starts(
fragment_bed_gz = fragment_bed_path,
tss_windows = tss_windows_m,
tss_site = tss_site_m,
tss_strand = tss_strand_m,
flank = 2000L
)Show code
# Smooth the profile
tss_profile_smooth <- tss_profile %>%
dplyr::mutate(
enrichment_smooth = moving_average(enrichment, k = 21),
enrichment_smooth = dplyr::if_else(is.na(enrichment_smooth), enrichment, enrichment_smooth)
)
p_tss <- ggplot(tss_profile_smooth, aes(x = rel_bp, y = enrichment_smooth)) +
geom_hline(yintercept = 1, color = "gray60", linewidth = 0.35) +
geom_vline(xintercept = 0, color = "gray30", linewidth = 0.5) +
geom_line(linewidth = 0.9, color = "#2E4A62") +
annotate("text", x = 50, y = max(tss_profile_smooth$enrichment_smooth, na.rm = TRUE) * 0.95,
label = "TSS", hjust = 0, size = 4, fontface = "bold") +
labs(
title = "TSS Enrichment of Fragment 5' Start Sites",
subtitle = paste("Muscle genes (COL6A1, COL6A2) on chr21 | Sample:", SAMPLE_ID),
x = "Distance to TSS (bp)",
y = "Normalized Fragment Density"
) +
theme_pub(base_size = 12)
p_tss7 End Motif Analysis
Cell-free DNA fragments exhibit characteristic 5’ end motif patterns reflecting the sequence preferences of nucleases involved in cfDNA generation.
Show code
# Extract 4-mer end motifs
motifs_all <- all_4mers()
valid_seqnames <- seqnames(genome)
motif_counts <- extract_4mer_counts_from_frag_bed(
frag_bed_gz = fragment_bed_path,
sample_id = SAMPLE_ID,
genome = genome,
valid_seqnames = valid_seqnames,
motifs_all = motifs_all
)
# Calculate frequencies
motif_freq <- motif_counts %>%
dplyr::mutate(
total = sum(count),
frequency = count / total
)7.1 Top 20 End Motifs
Show code
# Get top 20 motifs
top20_motifs <- motif_freq %>%
dplyr::arrange(desc(frequency)) %>%
dplyr::slice_head(n = 20) %>%
dplyr::mutate(motif = factor(motif, levels = rev(motif)))
p_top20 <- ggplot(top20_motifs, aes(x = frequency, y = motif)) +
geom_col(fill = "#3C5488", alpha = 0.9, width = 0.7) +
geom_text(aes(label = sprintf("%.4f", frequency)), hjust = -0.1, size = 3) +
scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Top 20 cfDNA 5' End Motifs (4-mer)",
subtitle = paste("Sample:", SAMPLE_ID),
x = "Frequency",
y = "Motif"
) +
theme_pub(base_size = 12) +
theme(panel.grid.major.y = element_blank())
p_top207.2 Motif Diversity Score
The Motif Diversity Score (MDS) is calculated as the normalized Shannon entropy of the 256 4-mer frequencies.
Show code
# Calculate MDS
freq_vec <- motif_freq$frequency
entropy_bits <- shannon_entropy_bits(freq_vec)
max_entropy <- log2(256) # Maximum possible entropy for 256 motifs
mds <- entropy_bits / max_entropy
mds_display <- tibble::tibble(
Metric = c("Shannon Entropy (bits)", "Maximum Entropy (bits)", "Motif Diversity Score (MDS)"),
Value = c(sprintf("%.4f", entropy_bits), sprintf("%.4f", max_entropy), sprintf("%.4f", mds))
)
knitr::kable(mds_display, caption = "End Motif Diversity Metrics") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Metric | Value |
|---|---|
| Shannon Entropy (bits) | 7.6745 |
| Maximum Entropy (bits) | 8.0000 |
| Motif Diversity Score (MDS) | 0.9593 |
Show code
# Create a simple gauge-like visualization for MDS
p_mds <- ggplot() +
geom_rect(aes(xmin = 0, xmax = 1, ymin = 0, ymax = 1), fill = "gray90") +
geom_rect(aes(xmin = 0, xmax = mds, ymin = 0, ymax = 1), fill = "#3C5488", alpha = 0.8) +
geom_text(aes(x = 0.5, y = 0.5, label = sprintf("MDS = %.4f", mds)),
size = 8, fontface = "bold", color = "white") +
geom_text(aes(x = 0.5, y = -0.3, label = "Shannon entropy of 256 4-mer frequencies (normalized)"),
size = 3.5, color = "gray40") +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
scale_y_continuous(limits = c(-0.5, 1.2)) +
labs(title = paste("Motif Diversity Score:", SAMPLE_ID)) +
theme_void(base_size = 12) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5, size = 14, margin = ggplot2::margin(b = 10)),
axis.text.x = element_text(color = "gray30"),
axis.ticks.x = element_line(color = "gray30")
)
p_mds8 Classification Prediction
This section uses the pre-trained Random Forest model to predict whether the sample belongs to the ALS or Control group based on methylation features.
Show code
# Check if model exists
model_available <- file.exists(MODEL_PATH) && file.exists(STACKHMM_PATH)
if (model_available) {
# Load pre-trained model
all_results <- readRDS(MODEL_PATH)
# Load stackHMM annotations
stackhmm_df <- readr::read_tsv(
STACKHMM_PATH,
col_names = c("chr", "start", "end", "state"),
col_types = "ciic",
progress = FALSE
) %>%
dplyr::filter(chr == "chr21")
stackhmm_gr <- GenomicRanges::GRanges(
seqnames = stackhmm_df$chr,
ranges = IRanges::IRanges(start = stackhmm_df$start + 1L, end = stackhmm_df$end),
state = stackhmm_df$state
)
states <- unique(stackhmm_df$state)
# Load methylation data from bismark coverage file
bs <- bsseq::read.bismark(
files = BISMARK_COV_FILE,
rmZeroCov = TRUE,
strandCollapse = TRUE,
verbose = FALSE
)
# Extract methylation values
cpg_gr <- GenomicRanges::granges(bs)
meth_mat <- bsseq::getMeth(bs, type = "raw", what = "perBase")
# Calculate stackHMM methylation features
stackhmm_meth <- numeric(length(states))
names(stackhmm_meth) <- states
for (i in seq_along(states)) {
cpg_idx <- S4Vectors::queryHits(
GenomicRanges::findOverlaps(cpg_gr, stackhmm_gr[stackhmm_gr$state == states[i]])
)
if (length(cpg_idx) > 0) {
stackhmm_meth[states[i]] <- median(meth_mat[cpg_idx, 1], na.rm = TRUE)
} else {
stackhmm_meth[states[i]] <- NA
}
}
# Prepare features for prediction (using stackHMM model)
model_result <- all_results[["stackHMM"]]
selected_features <- model_result$selected_features
# Create feature matrix
new_features <- matrix(stackhmm_meth[selected_features], nrow = 1,
dimnames = list(SAMPLE_ID, selected_features))
# Handle missing features
new_features[is.na(new_features)] <- 0
# Make prediction
prediction <- predict_single_sample(new_features, model_result)
prediction_available <- TRUE
} else {
prediction_available <- FALSE
message("Pre-trained model not available. Skipping classification prediction.")
}Show code
if (prediction_available) {
# Create prediction visualization
prob_data <- tibble::tibble(
Group = c("Ctrl", "ALS"),
Probability = c(pred_prob$Ctrl, pred_prob$ALS)
) %>%
dplyr::mutate(Group = factor(Group, levels = c("Ctrl", "ALS")))
# Create label for binary prediction
confidence <- max(pred_prob$Ctrl, pred_prob$ALS) * 100
prediction_label <- sprintf("Prediction: %s (%.1f%% confidence)", pred_class, confidence)
p_pred <- ggplot(prob_data, aes(x = Group, y = Probability, fill = Group)) +
geom_col(width = 0.6, alpha = 0.9) +
geom_text(aes(label = sprintf("%.1f%%", Probability * 100)),
vjust = -0.5, size = 5, fontface = "bold") +
geom_hline(yintercept = 0.5, linetype = "dashed", color = "gray50") +
annotate("label", x = 1.5, y = 0.95, label = prediction_label,
size = 5, fontface = "bold", fill = pred_color, color = "white",
label.padding = unit(0.5, "lines"), label.r = unit(0.3, "lines")) +
scale_fill_manual(values = COLORS) +
scale_y_continuous(limits = c(0, 1.15), labels = scales::percent) +
labs(
title = "Classification Prediction Result",
subtitle = paste("Sample:", SAMPLE_ID, "| Model: stackHMM methylation features"),
x = NULL,
y = "Prediction Probability"
) +
theme_pub(base_size = 14) +
theme(legend.position = "none")
print(p_pred)
}Show code
if (prediction_available) {
pred_table <- tibble::tibble(
Metric = c("Predicted Class", "Ctrl Probability", "ALS Probability", "Confidence"),
Value = c(
pred_class,
sprintf("%.2f%%", pred_prob$Ctrl * 100),
sprintf("%.2f%%", pred_prob$ALS * 100),
sprintf("%.2f%%", max(pred_prob$Ctrl, pred_prob$ALS) * 100)
)
)
knitr::kable(pred_table, caption = "Classification Prediction Results") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
kableExtra::row_spec(1, bold = TRUE, color = "white", background = pred_color)
}| Metric | Value |
|---|---|
| Predicted Class | ALS |
| Ctrl Probability | 25.80% |
| ALS Probability | 74.20% |
| Confidence | 74.20% |
9 Summary
Show code
# Compile all key results
summary_results <- tibble::tibble(
Category = c(
rep("Quality Control", 5),
rep("Fragment Analysis", 4),
rep("End Motifs", 2),
if (prediction_available) "Classification" else character(0)
),
Metric = c(
"Total Reads", "Mapping Rate", "Bisulfite Conversion", "CpG Methylation", "Mean MAPQ",
"Median Fragment Length", "Mode Fragment Length", "Short/Long Ratio", "TSS Enrichment Score",
"Top Motif", "Motif Diversity (MDS)",
if (prediction_available) "Predicted Group" else character(0)
),
Value = c(
format(summary_stats$total_reads, big.mark = ","),
sprintf("%.2f%%", summary_stats$mapping_rate * 100),
sprintf("%.2f%%", summary_stats$bisulfite_conversion * 100),
sprintf("%.2f%%", summary_stats$cpg_methylation * 100),
sprintf("%.1f", summary_stats$mean_mapq),
sprintf("%d bp", as.integer(summary_stats$median_frag_length)),
sprintf("%d bp", summary_stats$frag_mode),
sprintf("%.3f", short_long_ratio),
sprintf("%.2f", max(tss_profile$enrichment, na.rm = TRUE)),
as.character(top20_motifs$motif[1]),
sprintf("%.4f", mds),
if (prediction_available) sprintf("%s (%.1f%%)", pred_class, max(pred_prob) * 100) else character(0)
)
)
knitr::kable(summary_results, caption = paste("Analysis Summary for", SAMPLE_ID)) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
kableExtra::pack_rows("Quality Control", 1, 5) %>%
kableExtra::pack_rows("Fragment Analysis", 6, 9) %>%
kableExtra::pack_rows("End Motifs", 10, 11) %>%
{if (prediction_available) kableExtra::pack_rows(., "Classification", 12, 12) else .}| Category | Metric | Value |
|---|---|---|
| Quality Control | ||
| Quality Control | Total Reads | 128,966 |
| Quality Control | Mapping Rate | 100.00% |
| Quality Control | Bisulfite Conversion | 99.53% |
| Quality Control | CpG Methylation | 82.71% |
| Quality Control | Mean MAPQ | 38.1 |
| Fragment Analysis | ||
| Fragment Analysis | Median Fragment Length | 162 bp |
| Fragment Analysis | Mode Fragment Length | 160 bp |
| Fragment Analysis | Short/Long Ratio | 0.438 |
| Fragment Analysis | TSS Enrichment Score | -Inf |
| End Motifs | ||
| End Motifs | Top Motif | AAAA |
| End Motifs | Motif Diversity (MDS) | 0.9593 |
| Classification | ||
| Classification | Predicted Group | ALS (74.2%) |
10 Methods
10.1 Data Processing
The BAM file was processed using custom R functions that implement the following pipeline:
- Read Filtering: Reads were filtered to retain properly paired, non-duplicate alignments with MAPQ ≥ 30.
- Fragment Extraction: Fragment coordinates were derived from paired-end alignments using the union span of mate alignments.
- GC Content: Fragment GC content was calculated as (G+C)/(A+C+G+T) excluding N bases.
- Methylation Extraction: CpG methylation was extracted from Bismark XM tags or coverage files.
10.2 Fragmentation Analysis
- Insert Size Distribution: Fragment lengths were computed from paired-end alignment coordinates.
- Short/Long Ratio: Ratio of fragments 90-150 bp (short) to 151-220 bp (long).
- GC Correction: LOESS regression was used to correct for GC bias in fragment counts.
10.3 End Motif Analysis
- Motif Extraction: 4-mer sequences were extracted from both 5’ ends of each fragment (strand-aware).
- Motif Diversity Score: Calculated as normalized Shannon entropy: MDS = H(p) / log₂(256).
10.4 Classification
- Model: Random Forest classifier trained on ALS vs Control cohort data.
- Features: stackHMM-based regional methylation levels (100 chromatin states).
- Validation: Nested cross-validation (LOOCV outer, 5-fold inner) with Boruta feature selection.
10.5 Software
This analysis was performed using R version R version 4.5.1 (2025-06-13) with Bioconductor packages for genomic analysis. All core algorithms are identical to those used in the cohort analysis to ensure reproducibility.
Report generated on 2026-01-30 16:19:22.425818